Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

ntal data is very likely contaminated by noise, the basic principle

almost all regression analysis algorithms is what is called

g random observations to unknown truth, i.e., the mean of the

ons. All regression analysis approaches therefore have the same

i.e., “regress to mean”. In a simple experiment, it is assumed that

ervations are composed of a variation due to sample preparation

hnique error and data collection error. The error is assumed to be

and mostly follow a Gaussian distribution. Regressing these

ll finally lead the answer to the mean of these observations.

errors of a collected data set are assumed to follow a Gaussian

on, the mean of these observations is the true value. The

d of the Gaussian errors of N observed data is defined as below,

is the n^th data point, ߪ^ଶ stands for the variance of the data, ߤ is

own truth, i.e., the mean,

ࣦൌሺ2ߨሻ^ିே/ଶෑexp ቆെ

ሺݔ௡െߤሻ^ଶ

2ߪ^ଶ

ቇ

ே

௡ୀଵ





ying the negative logarithm to this likelihood function leads to the

g equation,

െlogࣦൌ෍ሺݔ௡െߤሻ^ଶ

ே

௡ୀଵ

൅ܥ



rivative is shown below,

∂ሺെlogࣦሻ

߲ߤ

ൌ෍ݔ௡

ே

௡ୀଵ

െܰߤ



ng this derivative leads to the estimated mean as shown below,

ߤ̂ ൌ¹

ܰ^෍ݔ^௡

ே

௡ୀଵ



